ATOM Documentation

← Back to App

PostgreSQL Graph Architecture Proposal

1. Executive Summary

We propose shifting the GraphRAGEngine storage backend from **LanceDB (Edge List)** to **PostgreSQL (Relational Graph)**.

**Why?**

  • **Statelessness**: Allows utilizing atomic SQL queries (Recursive CTEs) for traversal without loading the entire graph into RAM.
  • **Infrastructure**: Reuses the existing PostgreSQL instance. No new infrastructure (Embedded Graph) or sticky volumes required.
  • **Reliability**: ACID compliance for graph updates.

2. Proposed Schema

We will introduce 3 normalized tables to represent the property graph structure.

`graph_nodes`

Represents entities (People, Projects, Documents).

CREATE TABLE graph_nodes (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    workspace_id VARCHAR(255) NOT NULL,
    name VARCHAR(512) NOT NULL,
    type VARCHAR(50) NOT NULL, -- e.g., 'person', 'task'
    description TEXT,
    properties JSONB DEFAULT '{}', -- Flexible metadata
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    
    -- Fast lookup
    CONSTRAINT uniq_node_workspace_name UNIQUE (workspace_id, name, type)
);
CREATE INDEX idx_nodes_workspace ON graph_nodes(workspace_id);

`graph_edges`

Represents relationships.

CREATE TABLE graph_edges (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    workspace_id VARCHAR(255) NOT NULL,
    source_node_id UUID REFERENCES graph_nodes(id) ON DELETE CASCADE,
    target_node_id UUID REFERENCES graph_nodes(id) ON DELETE CASCADE,
    relationship_type VARCHAR(50) NOT NULL, -- e.g., 'manages', 'blocks'
    weight FLOAT DEFAULT 1.0,
    properties JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_edges_source ON graph_edges(source_node_id);
CREATE INDEX idx_edges_target ON graph_edges(target_node_id);

`graph_communities` (For GraphRAG)

Stores the hierarchical clusters detected by the Leiden algorithm.

CREATE TABLE graph_communities (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    workspace_id VARCHAR(255) NOT NULL,
    level INT NOT NULL, -- Hierarchy level (0, 1, 2)
    summary TEXT NOT NULL, -- LLM-generated summary
    keywords JSONB, -- List of top keywords
    created_at TIMESTAMP DEFAULT NOW()
);

`community_membership`

Maps nodes to communities.

CREATE TABLE community_membership (
    community_id UUID REFERENCES graph_communities(id) ON DELETE CASCADE,
    node_id UUID REFERENCES graph_nodes(id) ON DELETE CASCADE,
    PRIMARY KEY (community_id, node_id)
);

3. Workflow & Logic

A. Ingestion (Write)

When a document is processed:

  1. **Extract**: LLM extracts (Entity A, Relation, Entity B).
  2. **Upsert Nodes**: Insert Entity A and Entity B into graph_nodes (ON CONFLICT DO NOTHING).
  3. **Insert Edges**: Insert the relationship into graph_edges.

B. Retrieval (Read) - "Stateless Graph"

Instead of hydrate_workspace (loading 10k nodes into RAM), we use **Recursive SQL**.

**Local Search (Traversal):**

  • *Question*: "What does Sarah work on?"
  • *SQL*: Find Node "Sarah" -> JOIN graph_edges -> JOIN graph_nodes (Target).
  • *Depth*: Efficiently handled by Postgres for depths 1-3.
-- Example: Find everything Sarah touches (Depth 1)
SELECT target.name, edge.relationship_type
FROM graph_nodes source
JOIN graph_edges edge ON source.id = edge.source_node_id
JOIN graph_nodes target ON edge.target_node_id = target.id
WHERE source.name = 'Sarah' AND source.workspace_id = 'ws_123';

C. Community Detection (Async Job)

Since we can't run Leiden/Louvain efficiently in SQL:

  1. **Job**: Periodically (e.g., hourly), a worker process fetches the full graph for a workspace.
  2. **Compute**: Loads into NetworkX (RAM), runs Leiden algorithm.
  3. **Summarize**: Generates summaries via LLM.
  4. **Save**: Writes results to graph_communities table.
  5. **Free RAM**: Process exits or clears memory.

4. Pros vs Cons

FeaturePostgres ArchitectureLanceDB/In-Memory (Legacy)
**Persistence**✅ Excellent (RDS)❌ Poor (Volatile RAM)
**Memory Usage**✅ **Low** (O(1) per query)❌ High (O(N) per workspace)
**Traversal Speed**⚠️ Medium (SQL Joins)✅ Instant (In-Memory)
**Complexity**✅ Low (Standard SQL)⚠️ Medium (Custom Engine)
**Vector Search**✅ Native (pgvector)✅ Native (LanceDB)

*Note: We will KEEP LanceDB for pure text chunk vector search. Postgres is purely for the structural Metadata Graph.*

5. Migration Strategy

  1. **Definition**: Create SQLAlchemy models in backend-saas/core/models.py.
  2. **Migration**: Generate Alembic migration scripts.
  3. **Refactor**: Update GraphRAGEngine to read/write from SQL models instead of internal dicts.